Machine learning-based cluster analysis of immune cell subtypes and breast cancer survival

Host immunity involves various immune cells working in concert to achieve balanced immune response. Host immunity interacts with tumorigenic process impacting disease outcome. Clusters of different immune cells may reveal unique host immunity in relation to breast cancer progression. CIBERSORT algorithm was used to estimate relative abundances of 22 immune cell types in 3 datasets, METABRIC, TCGA, and our study. The cell type data in METABRIC were analyzed for cluster using unsupervised hierarchical clustering (UHC). The UHC results were employed to train machine learning models. Kaplan–Meier and Cox regression survival analyses were performed to assess cell clusters in association with relapse-free and overall survival. Differentially expressed genes by clusters were interrogated with IPA for molecular signatures. UHC analysis identified two distinct immune cell clusters, clusters A (83.2%) and B (16.8%). Memory B cells, plasma cells, CD8 positive T cells, resting memory CD4 T cells, activated NK cells, monocytes, M1 macrophages, and resting mast cells were more abundant in clusters A than B, whereas regulatory T cells and M0 and M2 macrophages were more in clusters B than A. Patients in cluster A had favorable survival. Similar survival associations were also observed in other independent studies. IPA analysis showed that pathogen-induced cytokine storm signaling pathway, phagosome formation, and T cell receptor signaling were related to the cell type clusters. Our finding suggests that different immune cell clusters may indicate distinct immune responses to tumor growth, suggesting their potential for disease management.

www.nature.com/scientificreports/immunotherapy 17,18 .TNBC patients with high tumor infiltrating lymphocytes (TIL) are more responsive to ICI, whereas those with hormone receptor-positive breast tumors and low TIL are less responsive 19 .This discrepancy in TIL is explained in part by the differences in somatic mutations which not only reprogram cell signal pathways and metabolisms, but also generate tumor-associated and tumor-specific antigens (TAA, TSA) 20,21 .These altered or mutant molecules induce host immune response by attracting immune cell infiltration and congregation.Characterizing the abundance and composition of immune cell subtypes in tumor samples has shown values in disease prognosis and prediction of treatment responses 22,23 .
Cell sorting by flow cytometry and tissue staining with immunohistochemistry have been used to assess TIL, but these methods have some limitations with respect to tissue accessibility, processing challenges, and subjective evaluation 24 .Recently, computational approaches have been developed for in silico prediction of immune cell subtype abundances based on the readily available gene expression data on tissue transcriptomes.To assess if immune cell subtype clusters are useful for breast cancer prognosis, we analyzed transcriptomic data from several breast cancer datasets using the computation algorithm CYBERSORT 25 .The results of our analyses are presented in this report.
UHC analysis indicated two clusters of immune cell subtypes in METABRIC (Supplementary Fig. 1).One cluster (hcluster 1 or cluster A) was observed in 1113 patients (83.2%), and another (hcluster 2 or cluster B) was in 224 patients (16.8%).Differences in cell subtypes between the two clusters and their comparisons with normal breast tissues are shown in Table 1.Cell subtypes which were significantly different between the two clusters included memory B cells, plasma cells, CD8 positive T cells, resting memory CD4 T cells, activated NK cells, monocytes, M1 macrophages, and resting mast cells, which showed higher abundances in cluster A than cluster B. Cell subtypes with relative abundances higher in cluster B than cluster A were regulatory T cells and M0 and M2 macrophages.
Immune cell subtype abundances were very different between normal breasts and breast tumors (Table 1).Compared to normal breasts, less abundant cell types in breast tumors included naïve B cells, resting CD4 memory T cells, resting NK cells, M2 macrophages, and resting mast cells; more abundant cell types in tumor samples were memory B cells, follicular helper T cells, gamma delta T cells, and M0 and M1 macrophages.Different abundances between clusters A and B tumor samples in comparison to normal breasts were plasma cells (higher in A, but lower in B), CD8 T cells (no difference in A, but lower in B), and activated NK cells (higher in A, but no difference in B).
Associations of immune cell clusters with clinical and pathological variables of breast cancer in METABRIC are shown in Table 2. Patients with ER negative tumors or invasive ductal carcinoma were more prevalent in cluster B than in cluster A, and patients in cluster B were also more likely to develop recurrent disease or die.

Immune cell cluster modeling
We used random forest (RF) to build a prediction model for cell subtype clusters.The RF model was trained with the UHC results in 60% of the METABRIC data, and the model fit well to the UHC clusters with 100% and 98% AUC in the training and testing sets, respectively (Supplementary Fig. 2).Although DNN, elastic net, and stepAIC models were also matched well to UHC, the AUC of RF in the training set was higher than that in other three models.Thus, we used the RF model to predict immune cell clusters in the Turin study and TCGA.The RF predicted cell clusters were analyzed for its associations with patient survival.Similar associations with relapse-free and overall survival were found in the Turin study (Fig. 2), i.e., cluster B associated with poor survival, although the associations were not statistically significant after adjusting for clinicopathological variables (Table 3).Associations between patient survival and immune cell clusters were also observed in TCGA.Patients with immune cell subtypes in cluster B had higher risks for disease recurrence and death compared to those with cell subtypes in cluster A (Fig. 2).The survival associations in TCGA were statistically significant after adjusting for clinicopathological variables (Table 3).No associations between immune cell clusters and ER status or histological types were observed in these validation studies (data not shown).
The importance of each cell type in the RF model was evaluated with mean decreases in accuracy and the Gini coefficient.The top 5 important cell types were M0 and M2 macrophages, CD8 positive T cells, activated NK cells, and resting mast cells (Supplementary Fig. 3).The stepAIC analysis showed a 19-cell model, and the elastic net suggested a 13-cell regression (Supplementary Table 1).Twelve cell types were common in both models, including naïve B cells, memory B cells, plasma cells, CD8 positive T cells, resting memory CD4 T cells, activated memory CD4 T cells, regulatory T cells, activated NK cells, M2 macrophages, resting mast cells, activated mast cells, and neutrophils.

IPA analysis on DEGs
There were 16,621 genes overlapping between the transcriptomic data of METABRIC and TCGA.IPA was performed on the 268 DEGs in TCGA (absolute log2 fold change at 1.2 or larger for cluster B versus cluster A; BH adjusted P < 0.05) (Fig. 3A).Since the expression data in METABRIC had a smaller range and the median www.nature.com/scientificreports/fold change was only 1.001 (IQR: 0.996-1.184),we used the absolute log2 fold change at 0.07 as a threshold and selected 306 DEGs for IPA analysis.Volcano plot showed the selected DEGs in METABRIC and TCGA (Fig. 3B,C).Graphical summary of IPA analysis on cell cluster associated DEGs showed that the transcription profiles were similar between METABRIC and TCGA, with most of the signal pathways being downregulated (Supplementary Fig. 4).The top 5 common signal pathways predicted by IPA in METABRIC and TCGA were pathogen induced cytokine storm signaling pathway, phagosome formation, T cell receptor signaling, T helper 1 pathway, and macrophage classical activity, all of which were downregulated (Fig. 4A).The T cell receptor signaling showed the similar patterns of network in METABRIC and TCGA (Fig. 4B,C).

Discussion
We used CIBERSORT to estimate the relative abundances of 22 immune cell subtypes in breast cancer and normal breast tissues and found significant differences in cell types between tumor and normal tissues.The deconvolution results on cell subtypes were further analyzed in breast cancer (METABRIC) with unsupervised hierarchical clustering, and the analysis suggested two distinct clusters of immune cell subtypes associated with different survival outcomes of breast cancer.These survival associations were replicated independently in our study (Turin) and TCGA when using a random forest model which was trained with the UHC classifications in METABRIC.The survival associations with immune cell clusters appeared to be independent from most known  26 .Their analysis showed 7 clusters in 6071 samples.The authors concluded that there were    between the two studies, but both studies indicate that breast cancer may be classified into immunity-based subtypes which have clinical implications in predicting disease prognosis and treatment response.Going through the cell types in each cluster, we found that memory B cells, plasma cells, CD8 positive T cells, resting memory CD4 T cells, activated NK cells, monocytes, M1 macrophages, and resting mast cells were significantly higher in the favorable cluster (cluster A), whereas regulatory T cells and M0 and M2 macrophages were substantially higher in unfavorable cluster (cluster B).These differentiating cell types appear to be consistent with the current understanding that hot or immune-inflamed TME, which has favorable prognosis and is responsive to immunotherapy, is infiltrated with cytotoxic T cells (CD8 positive T cells), NK cells, and M1 macrophages, whereas cold TME is filled with immunosuppressive lymphocytes like regulatory T cells and tumor-associated macrophages (TAM), M0 and M2 28 .NK cells and CD8 positive T cells are known to be able to suppress tumor growth through their cytotoxic activities 28,29 .Furthermore, CD4 memory T cells and M1 microphages facilitate the effects of NK cells and cytotoxic T cells 30 .Conversely, M2 macrophages and regulatory T cells inhibit the activities of CD4 memory and CD8 cytotoxic T cells, respectively 31 .
We analyzed the cell type data by focusing on immune cells in clusters instead of individual cells because host immunity is complex and involves different mechanisms and diverse cell lineages which give rise to innate versus adaptive, local versus systemic, and cellular versus humoral immunities.These distinct immune activities are carried out by a variety of cell types which work in concert to mount an appropriate immune response 32 .Thus, analyzing any single cell or a few cell types may not reveal enough insights into the interplay between tumor immunogenicity and host immune response as well as the potential impact of their interaction on tumor growth and disease outcome 33 .Ali et al. 26 assessed individual immune cell subtypes in relation to breast cancer survival by ER status, and the large study found that only two cell types showed consistent associations with survival outcomes, regulatory T cells and M2 macrophages, both of which were associated with poor survival.Although multiple cell type clusters were found in that study, the survival associations with some cell types were generally consistent with those observed in our cluster analysis.For example, Ali et al. 26 found favorable survival associations with monocytes and memory B cells in ER positive tumors and with CD8 positive T cells in ER negative tumors, as well as unfavorable survival associations with M0 macrophages for ER positive tumors.
TCR stimulation is a fundamental step in most T cell responses.TCR signaling is important for many aspects of T cell regulation, including development, differentiation, activation, proliferation, and survival.Dysregulation of TCR signaling can result in allergy and autoimmune diseases 34 .The molecular mechanism of TCR suppression underlying the link between immune cells in cluster B and breast cancer progression remains to be elucidated.
One limitation of our study is that we cannot assess the temporal and spatial variations of immune cell subtypes in tumor specimens, which is known to play an important role in determining the effect of host immunity and host-immune interplay in addition to cell types 35,36 .Anti-cancer therapies are known to have significant impacts on TME and immune cell infiltration 37 .Our analysis of immune cell subtypes in cluster only reflects the cell composition at the time of initial mastectomy which may be considered as a baseline status of TME that is different from those of post-surgery and during systemic anti-cancer treatment.The other limitation is that our deconvolution was not based on the entire 547 reference genes in LM22.Although not all signature matrix genes are required for deconvolution, the algorithm's performance is improved with the presence of more signature genes 38 .

Conclusions
This study applied different machine learning methods to analyze immune cell subtypes in clusters and found two distinct clusters in breast cancer associated with survival outcomes.The survival associations were replicated independently in two additional datasets.Immune cell subtypes which were more abundant in the cluster of favorable prognosis included memory B cells, plasma cells, CD8 positive T cells, resting memory CD4 T cells, activated NK cells, monocytes, M1 macrophages, and resting mast cells, and those less abundant were regulatory T cells, and M0 and M2 macrophages.The immune cell clusters associated with breast cancer progression may involve suppression of pathogen induced cytokine storm signaling pathway, phagosome formation, T cell receptor signaling, T helper 1 cell pathway, and macrophage classical activity pathways.Our finding suggests that immune cell clusters in primary breast cancer may be an important parameter to consider, in addition to individual cell types, when predicting disease outcome and planning treatment strategy.

Study design and participants
Two online datasets on transcriptome, METABRIC and TCGA 39,40 , were used for analysis together with their clinical and follow-up information.METABRIC, downloaded from cBioPortal (https:// www.cbiop ortal.org/) 41,42 , has 1903 breast tumor samples with gene expression data on 24,368 genes measured by a microarray chip from Illumine (Illumina HT-12 v3).The log2 intensity values were used for cell type deconvolution.Clinical data and survival information available for analysis in METABRIC include age at diagnosis, disease stage, tumor grade, histological type, estrogen receptor (ER) status, progesterone receptor (PR) status, ERBB2 (HER2) overexpression, disease recurrence, death, and follow-up time.TCGA RNA-seq data, expressed as fragments per kilobase of exon per million mapped fragments (FPKM), on 1075 breast tumor samples were downloaded from the Genomic Data Commons (GDC) data portal (https:// portal.gdc.cancer.gov/) 43 .The corresponding clinical information was downloaded from cBioPortal.
An independent dataset of tumor transcriptomes from 204 breast cancer patients was available from a previous study (Turin) of ours described in detail elsewhere 44 .In brief, we recruited 348 patients who were diagnosed with primary breast cancer and underwent mastectomies in the University Hospital at University of Turin in Italy 45 .Fresh tumor samples were collected during surgery and snap-frozen in liquid nitrogen immediately after resection.Total RNA was extracted, of which 205 were selected for microarray analysis using the Illumina Expression BeadChip (HumanRef-8 v1).The raw expression data (~ .idat)generated by the Illumina microarray assay were processed using GenomeStudio V2011.1.Data was normalized using the function neqc() in R package limma, This function performs normexp background correction using negative controls, then quantile normalizes and finally log2 transforms.The normalized data was ready for CIBERSORT deconvolution of 22 immune cell types 24,46 .Transcriptomic data on normal breast tissues were downloaded from the GTEx Portal (https:// www.gtexp ortal.org/ home/ datas ets) which contains the transcripts per million (TPM) of RNA-seq data on 459 tissue specimens (GTEx Analysis V8).
CIBERSORT estimates the relative abundances of immune cell subtypes in tissue samples.The computation algorithm deconvolutes 22 immune cell subtypes from tumor transcriptomes using reference LM22.LM22 includes the expression of 547 reference genes, of which 475 were available in METABRIC, 444 in the Turin study, 537 in TCGA, and 527 in the GTEx data.CIBERSORT interrogates tumor transcriptome for immune cell subtypes based on the assumption that tissue samples contain mixed cell populations 38 .To evaluate the validity of cell type deconvolution in METABRIC and TCGA, we selected 100 permutations as recommended to achieve statistical rigor without applying quantile normalization.Tumor samples with deconvolution results not significantly different from the null hypothesis (p > 0.05) were excluded from final analysis.The null hypothesis assumes no immune cell subtypes present in a tumor sample based on LM22.After removing the samples without significance, we obtained 1337 samples from METABRIC, 848 samples from TCGA, and 269 samples from GTEx qualified for cell type analysis.

Model development and statistical analysis
We performed unsupervised hierarchical clustering (UHC) analysis on the immune cell subtypes from META-BRIC using the 'hclust' function with 'complete' selection in R. Based on the UHC results, we created a dichotomous variable on cell subtype clusters.Differences in immune cell subtypes between clusters were compared using the Mann-Whitney nonparametric U test.Associations of cell subtype clusters with clinical and pathological variables were analyzed with the Chi-square test.Kaplan-Meier survival curves and log-rank test were used to evaluate survival differences between patients in different immune cell clusters.Cox proportional hazards regression analysis was performed to determine survival associations with immune cell clusters while adjusting for clinicopathological variables.Two-side p values < 0.05 were considered statistical significance.All the analyses were performed using R (version 4.0.5).
To predict cell subtype clusters, we tested 4 machine learning models, including random forest (RF), deep neural network (DNN), stepAIC, and elastic net.The models were initially trained based on the UHC results in 60% of METABRIC and then tested in the remaining 40% of the data.METABRIC data were randomly split into training and testing sets.The RF model was developed using the 'randomForest' package in R with 500-tree selection.The importance of immune cell subtypes in the model was evaluated by mean decrease in accuracy and the Gini coefficient.The DNN model was trained using the CPU implementation of TensorFlow (version 1.14.0) for 2000 steps with a 7 × 7 hidden layer in Python (version 3.6.13).Regression models of stepAIC and elastic net were developed using the "MASS" and "glmnet" packages in R (version 4.0.3) 47.Model comparison was made between UHC and each of the 4 machine learning methods using the "pROC" package in R which calculates the receiver operating characteristic (ROC) curves and area under the curve (AUC) 48 .DeLong's test was used for AUC comparison between models.We also evaluated immune cytolytic activity by calculating the CYT score 49 .
Wilcoxon test was performed for the differentially expressed genes (DEG) analysis between cluster A and cluster B (cluster B vs. cluster A) in METABRIC and TCGA.P values were adjusted for the Benjamin-Hochberg correction (BH).The ingenuity pathway analysis (IPA) (www.qiagen.com/ ingen uity) was performed on the significant DEGs to explore the signal pathways enriched in cell clusters.

Figure 3 .
Figure 3. (A) Venn diagram of total available genes and DEGs in METABRIC and TCGA; (B) Volcano plot of DEGs identified in METABRIC; (C) Volcano plot of DEGs identified in TCGA.FC: fold change.

Figure 4 .
Figure 4. (A) Comparison of IPA analysis between METABRIC and TCGA using the ingenuity pathway analysis (IPA, www.qiagen.com/ ingen uity); (B) T cell receptor signaling predicted IPA in METABRIC; (C) T cell receptor signaling predicted by IPA in TCGA.

Table 2 .
Associations between clinicopathological variables in METABRIC and immune cell clusters.
*Student's T-test, Pearson's Chi-squared test, or Fisher's exact test where appropriate.Significant values are in bold.

variable Immune cell subtype clusters p value* Cluster A, n = 1113 (83.2%) Cluster B, n = 224 (16.8%) Total n = 1337
Vol.:(0123456789) Scientific Reports | (2023) 13:18962 | https://doi.org/10.1038/s41598-023-45932-4www.nature.com/scientificreports/clinical and pathological variables of breast cancer, suggesting the importance of host immunity in determining tumor progression and host-tumor interaction.The machine learning-based cell cluster analyses split the tumor samples into large (83%) and small (17%) groups, which appears to match with the general trend of breast cancer outcome where most patients have a favorable prognosis (> 80%).Previously, Ali et al. performed hierarchical clustering analysis on immune cell subtypes in 10,988 tumor samples from 56 studies

Table 3 .
Associations between immune cell subtype cluster and breast cancer survival.*Adjusted for age, stage, grade, ER, PR, and histology.# Tumor grade not included in multivariate analysis.Significant values are in bold.